Following the tutorial at:
In [24]:
import pandas as pd
In [25]:
# There are two data structures in pandas, Series and DataFrames
city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199])
In [26]:
pd.DataFrame({"City Name": city_names, "Population": population})
Out[26]:
In [27]:
# importing an existing csv file into DataFrame
california_housing_dataframe = pd.read_csv(
"https://storage.googleapis.com/mledu-datasets/california_housing_train.csv",
sep=","
)
In [28]:
california_housing_dataframe.shape
Out[28]:
In [29]:
california_housing_dataframe.head()
Out[29]:
In [30]:
california_housing_dataframe.hist('housing_median_age')
Out[30]:
In [31]:
cities = pd.DataFrame({'City Name': city_names, 'Population': population})
print(type(cities['City Name']))
cities['City Name']
Out[31]:
In [32]:
print(type(cities["City Name"][1]))
cities["City Name"][1]
Out[32]:
In [33]:
print(type(cities[0:2]))
cities[0:2]
Out[33]:
In [36]:
population / 1000
Out[36]:
In [37]:
import numpy as np
np.log(population)
Out[37]:
In [40]:
cities['Area square miles'] = pd.Series([46.87, 176.53, 97.92])
cities['Population density'] = cities['Population'] / cities['Area square miles']
cities
Out[40]:
In [39]:
population.apply(lambda val: val > 1000000)
Out[39]:
Modify the cities table by adding a new boolean column that is True if and only if both of the following are True:
Note: Boolean Series are combined using the bitwise, rather than the traditional boolean, operators. For example, when performing logical and, use & instead of and.
Hint: "San" in Spanish means "saint."
In [46]:
cities['is saint and wide'] = (cities['Area square miles'] > 50) & (cities['City Name'].apply(lambda name: name.startswith("San")))
cities
Out[46]:
Both Series and DataFrame objects also define an index property that assigns an identifier value to each Series item or DataFrame row.
By default, at construction, pandas assigns index values that reflect the ordering of the source data. Once created, the index values are stable; that is, they do not change when data is reordered.
In [47]:
city_names.index
Out[47]:
In [48]:
cities.index
Out[48]:
In [50]:
cities.reindex([2, 0, 1])
Out[50]:
Reindexing is a great way to shuffle (randomize) a DataFrame. In the example below, we take the index, which is array-like, and pass it to NumPy's random.permutation function, which shuffles its values in place. Calling reindex with this shuffled array causes the DataFrame rows to be shuffled in the same way.
In [52]:
cities.reindex(np.random.permutation(cities.index))
Out[52]:
In [53]:
cities.reindex([4, 2, 1, 3, 0])
Out[53]:
In [ ]: